Load in data

library(pepdiff)
d <- import_data("test_data/large.csv",
                 gene_id = "molecule_list_name",
                 peptide = "peptide_modified_sequence",
                 treatment = "genotype"
                 )
## Parsed with column specification:
## cols(
##   molecule_list_name = col_character(),
##   protein_description = col_character(),
##   molecule_list_note = col_character(),
##   peptide_modified_sequence = col_character(),
##   molecule_note = col_character(),
##   precursor_mz = col_double(),
##   precursor_charge = col_double(),
##   bio_rep_key = col_character(),
##   tech_rep_key = col_character(),
##   bio_sample_key = col_character(),
##   total_area = col_double(),
##   from_redo = col_logical(),
##   genotype = col_character(),
##   seconds = col_double(),
##   bio_rep = col_double(),
##   tech_rep = col_double(),
##   pep_id = col_character(),
##   tag = col_character()
## )

Check number of missing data points

Includes at tech rep level

Missing data

missing_peptides_plot(d)

Replicate level

times_measured(d)
## # A tibble: 4,352 x 5
## # Groups:   gene_id, peptide, treatment [1,088]
##    gene_id     peptide                      treatment seconds times_measured
##    <chr>       <chr>                        <chr>       <dbl>          <int>
##  1 AT1G01540.2 ADFASAAIAT[+80]PPIS[+80]KEIK bik1pbl         0              3
##  2 AT1G01540.2 ADFASAAIAT[+80]PPIS[+80]KEIK bik1pbl       150              3
##  3 AT1G01540.2 ADFASAAIAT[+80]PPIS[+80]KEIK bik1pbl       300              3
##  4 AT1G01540.2 ADFASAAIAT[+80]PPIS[+80]KEIK bik1pbl       600              3
##  5 AT1G01540.2 ADFASAAIAT[+80]PPIS[+80]KEIK Col-0           0              3
##  6 AT1G01540.2 ADFASAAIAT[+80]PPIS[+80]KEIK Col-0         150              3
##  7 AT1G01540.2 ADFASAAIAT[+80]PPIS[+80]KEIK Col-0         300              3
##  8 AT1G01540.2 ADFASAAIAT[+80]PPIS[+80]KEIK Col-0         600              3
##  9 AT1G01540.2 ADFASAAIAT[+80]PPISK         bik1pbl         0              3
## 10 AT1G01540.2 ADFASAAIAT[+80]PPISK         bik1pbl       150              3
## # … with 4,342 more rows
times_measured_plot(d)

Look at distribution of quantifications

plot_quant_distributions(d)
## Warning: Removed 588 rows containing non-finite values (stat_density).

plot_quant_distributions(d, log = TRUE)
## Warning: Removed 588 rows containing non-finite values (stat_density).

norm_qqplot(d)
## Warning: Removed 588 rows containing non-finite values (stat_qq).
## Warning: Removed 588 rows containing non-finite values (stat_qq_line).

norm_qqplot(d, log = TRUE)
## Warning: Removed 588 rows containing non-finite values (stat_qq).

## Warning: Removed 588 rows containing non-finite values (stat_qq_line).

Do a comparison

Col-0 0 seconds vs Col-0 150 seconds

r <- compare(d, iters = 1000, metrics=c("bootstrap_t", "wilcoxon", "rank_product"), control = 'Col-0', c_seconds = '0', treatment = 'Col-0', t_seconds = '150')
## Rank Product analysis for unpaired case 
##  
## 
##  done
r
## # A tibble: 544 x 25
##    gene_id peptide `Col-0_150_1` `Col-0_150_2` `Col-0_150_3` `Col-0_0_1`
##    <fct>   <fct>           <dbl>         <dbl>         <dbl>       <dbl>
##  1 AT1G01… ADFASA…      0.0102        0.0120        0.0112      0.0107  
##  2 AT1G01… ADFASA…      0.0120        0.00367       0.00479     0.0140  
##  3 AT1G03… SGSGAG…      0.000471      0.00271       0.000585    0.000471
##  4 AT1G04… SQVDGS…      0.00513       0.00750       0.00474     0.00500 
##  5 AT1G06… SYT[+8…      0.000435      0.000416      0.000828    0.000740
##  6 AT1G08… VTLVPP…      0.0332        0.0290        0.0388      0.0474  
##  7 AT1G09… GEPNVS…      0.00337       0.00471       0.00553     0.00213 
##  8 AT1G09… GEPNVS…      0.00337       0.00474       0.00562     0.00213 
##  9 AT1G09… VLVKGE…      0.00962       0.00512       0.00142     0.00769 
## 10 AT1G09… VLVKGE…      0.0112        0.00598       0.00170     0.00884 
## # … with 534 more rows, and 19 more variables: `Col-0_0_2` <dbl>,
## #   `Col-0_0_3` <dbl>, fold_change <dbl>, `unreplaced_Col-0_150_1` <dbl>,
## #   `unreplaced_Col-0_150_2` <dbl>, `unreplaced_Col-0_150_3` <dbl>,
## #   `unreplaced_Col-0_0_1` <dbl>, `unreplaced_Col-0_0_2` <dbl>,
## #   `unreplaced_Col-0_0_3` <dbl>, treatment_replicates <int>,
## #   control_replicates <int>, bootstrap_t_p_val <dbl>, bootstrap_t_fdr <dbl>,
## #   wilcoxon_p_val <dbl>, wilcoxon_fdr <dbl>, rank_prod_p2_p_val <dbl>,
## #   rank_prod_p1_p_val <dbl>, rank_prod_p2_fdr <dbl>, rank_prod_p1_fdr <dbl>

Look at distribution of calculated Fold Changes

plot_fc(r)

fc_qqplot(r)

plot_fc(r, log = TRUE)

fc_qqplot(r, log = TRUE)

Looks like fold changes are somehwat log-normal around the middle though they have fat tails, which means there are more extreme values than expected by a normal distribution. This means that any method that looks at values in the tails as outliers could give more false positives than we would usually expect.

Running tests

The current analysis performs a range of tests,

  1. Iterative Normal
  2. Bootstrap t-test
  3. Rank Product
  4. Kruskal-Wallis
  5. Wilcox (Mann-Whitney U)

The results contains two versions of the Iterative Normal, one using the Global Mean, the other using the Lowest Observed Value. Given the fat tails we observe this is likely to call false positives

The Bootstrap t-test samples r times from all the test and control values for each peptide (with replacement) making a new ‘test’ and ‘control’, for each new sample it calculates the distance in means as expressed by the quantity t (from the standard ‘t’-test) and builds a distribution for them. At the end of the iterations it checks where the real t lies, and it gives a p-value on the position in that distribution.

The Rank Product is a reworking of the method I tried last time, now that we always have the right number of values I can make the test sample across the peptides before it does the test so that it avoids the pairing issue from before, it will be weak on the samples with lots of replaced values, potentially falling into the same hole. There is also the complication that it only does tests ‘one way up’ so it needs to be looked at twice, one for the hypothesis test > control, once for control > test.

There are also two non-parametric tests, the Kruskal and Wilcox test, which would be considered a standard test where you can’t presume a normal distribution of ratios (these are the standard non-parametric tests that get used almost blindly in these situations).

Plot p-values

The plot below shows the method and the p-value at different numbers of observations. The thing to note is that the Kruskal method does nothing and the Wilcox is close to nothing, the Bootstrap t (which works on one side, showing p-values at 0 as significant, 1 is non-significant) shows a nice spread moving with the fold change, the Iterative Normal does the same, the RP works ok too. calling p-values largely in line with fold change.

plot_result(r)

readr::write_csv(r, "~/Desktop/test.csv")

Test distribution of p-values for the different methods

As a straighforward characteristic of p-values, they should be uniformly distributed. Closer to uniform is best.

p_value_hist(r )

Compare the significant peptides

In this UpSet Plot which compares the sets of significant peptides the bootstrap has the lowest number of overall significant calls (~ 25), the rp next (~ 50) then the iterative normal (~ 80). The Iterative Normal methods call 37 peptides significant uniquely, so are likely to be the most false positive. The rp method calls 18 uniquely, the bootstrap doesn’t call any uniquely. 11 are called by all methods, 35 by at least 3.

On balance, given the distribution of fold changes, the distribution of p-values and the significance call pattern I would say the Bootstrap t-test is the best test as it has statistical power but remains specific. The others give more significant calls which are likely to be more false positive laden, so are going to give more errors downstream. As the size of the larger sets is ~ 2x that of the smaller sets, you are looking at half of the calls being false positives.

compare_calls(r)

comparisons <- data.frame(
  control = c('Col-0','Col-0'),
  c_seconds = c(0,150),
  treatment = c('Col-0','Col-0'),
  t_seconds = c(150, 0)
)

many <- compare_many(d, comparisons, metrics = c("bootstrap_t", "rank_product"))
## Rank Product analysis for unpaired case 
##  
## 
##  done  Rank Product analysis for unpaired case 
##  
## 
##  done
many
## $`Col-0_150-Col-0_0`
## # A tibble: 544 x 23
##    gene_id peptide `Col-0_150_1` `Col-0_150_2` `Col-0_150_3` `Col-0_0_1`
##    <fct>   <fct>           <dbl>         <dbl>         <dbl>       <dbl>
##  1 AT1G01… ADFASA…      0.0102        0.0120        0.0112      0.0107  
##  2 AT1G01… ADFASA…      0.0120        0.00367       0.00479     0.0140  
##  3 AT1G03… SGSGAG…      0.000471      0.00271       0.000585    0.000471
##  4 AT1G04… SQVDGS…      0.00513       0.00750       0.00474     0.00500 
##  5 AT1G06… SYT[+8…      0.000435      0.000416      0.000828    0.000740
##  6 AT1G08… VTLVPP…      0.0332        0.0290        0.0388      0.0474  
##  7 AT1G09… GEPNVS…      0.00337       0.00471       0.00553     0.00213 
##  8 AT1G09… GEPNVS…      0.00337       0.00474       0.00562     0.00213 
##  9 AT1G09… VLVKGE…      0.00962       0.00512       0.00142     0.00769 
## 10 AT1G09… VLVKGE…      0.0112        0.00598       0.00170     0.00884 
## # … with 534 more rows, and 17 more variables: `Col-0_0_2` <dbl>,
## #   `Col-0_0_3` <dbl>, fold_change <dbl>, `unreplaced_Col-0_150_1` <dbl>,
## #   `unreplaced_Col-0_150_2` <dbl>, `unreplaced_Col-0_150_3` <dbl>,
## #   `unreplaced_Col-0_0_1` <dbl>, `unreplaced_Col-0_0_2` <dbl>,
## #   `unreplaced_Col-0_0_3` <dbl>, treatment_replicates <int>,
## #   control_replicates <int>, bootstrap_t_p_val <dbl>, bootstrap_t_fdr <dbl>,
## #   rank_prod_p2_p_val <dbl>, rank_prod_p1_p_val <dbl>, rank_prod_p2_fdr <dbl>,
## #   rank_prod_p1_fdr <dbl>
## 
## $`Col-0_0-Col-0_150`
## # A tibble: 544 x 23
##    gene_id peptide `Col-0_0_1` `Col-0_0_2` `Col-0_0_3` `Col-0_150_1`
##    <fct>   <fct>         <dbl>       <dbl>       <dbl>         <dbl>
##  1 AT1G01… ADFASA…    0.0107      0.0113      0.00979       0.0102  
##  2 AT1G01… ADFASA…    0.0140      0.00668     0.00512       0.0120  
##  3 AT1G03… SGSGAG…    0.000471    0.00194     0.000471      0.000471
##  4 AT1G04… SQVDGS…    0.00500     0.00752     0.00811       0.00513 
##  5 AT1G06… SYT[+8…    0.000740    0.000664    0.00110       0.000435
##  6 AT1G08… VTLVPP…    0.0474      0.0318      0.0410        0.0332  
##  7 AT1G09… GEPNVS…    0.00213     0.00632     0.00447       0.00337 
##  8 AT1G09… GEPNVS…    0.00213     0.00641     0.00452       0.00337 
##  9 AT1G09… VLVKGE…    0.00769     0.00837     0.00317       0.00962 
## 10 AT1G09… VLVKGE…    0.00884     0.00976     0.00377       0.0112  
## # … with 534 more rows, and 17 more variables: `Col-0_150_2` <dbl>,
## #   `Col-0_150_3` <dbl>, fold_change <dbl>, `unreplaced_Col-0_0_1` <dbl>,
## #   `unreplaced_Col-0_0_2` <dbl>, `unreplaced_Col-0_0_3` <dbl>,
## #   `unreplaced_Col-0_150_1` <dbl>, `unreplaced_Col-0_150_2` <dbl>,
## #   `unreplaced_Col-0_150_3` <dbl>, treatment_replicates <int>,
## #   control_replicates <int>, bootstrap_t_p_val <dbl>, bootstrap_t_fdr <dbl>,
## #   rank_prod_p2_p_val <dbl>, rank_prod_p1_p_val <dbl>, rank_prod_p2_fdr <dbl>,
## #   rank_prod_p1_fdr <dbl>
plot_heatmap(many, metric = "bootstrap_t", log = TRUE,  col_order = c("Col-0_150-Col-0_0", "Col-0_0-Col-0_150"))

plot_heatmap(many, metric = "rank_product", log = TRUE,  col_order = c("Col-0_150-Col-0_0", "Col-0_0-Col-0_150"))

plot_pca(d)

plot_kmeans(d)